Enriching SMT Training Data via Paraphrasing

نویسندگان

  • Wei He
  • Shiqi Zhao
  • Haifeng Wang
  • Ting Liu
چکیده

This paper proposes a novel method to resolve the coverage problem of SMT system. The method generates paraphrases for source-side sentences of the bilingual parallel data, which are then paired with the target-side sentences to generate new parallel data. Within a statistical paraphrase generation framework, we employ an object function, named Sentence Novelty, to select paraphrases which having the most novel information to the bilingual training corpus of the SMT model. Meanwhile, the context is considered via a language model in the source language to ensure the fluency and accuracy of paraphrase substitution. Compared to a state-of-the-art phrase based SMT system (Moses), our method achieves an improvement of 1.66 points in terms of BLEU on a small training corpus which simulates a resource-poor environment, and 1.06 points on a training corpus of medium size.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Chinese Sentence Polarity Classification via Opinion Paraphrasing

While substantial studies have been achieved on sentiment polarity classification to date, lacking enough opinion-annotated corpora for reliable t rain ing is still a challenge. In this paper we propose to improve a supported vector machines based polarity classifier by enriching both training data and test data via opinion paraphrasing. In particular, we first extract an equivalent set of attr...

متن کامل

Combining Multiple Resources to Improve SMT-based Paraphrasing Model

This paper proposes a novel method that exploits multiple resources to improve statistical machine translation (SMT) based paraphrasing. In detail, a phrasal paraphrase table and a feature function are derived from each resource, which are then combined in a log-linear SMTmodel for sentence-level paraphrase generation. Experimental results show that the SMT-based paraphrasing model can be enhan...

متن کامل

Improved Statistical Machine Translation Using Paraphrases

Parallel corpora are crucial for training SMT systems. However, for many language pairs they are available only in very limited quantities. For these language pairs a huge portion of phrases encountered at run-time will be unknown. We show how techniques from paraphrasing can be used to deal with these otherwise unknown source language phrases. Our results show that augmenting a stateof-the-art...

متن کامل

CLARIFY: Human-Powered Training of SMT Models

We present CLARIFY, an augmented environment that aims to improve the quality of translations generated by phrase-based statistical machine translation systems through learning from humans. CLARIFY employs four types of knowledge input: 1) direct input 2) results from either a word alignment game, 3) an phrase alignment game, or 4) a paraphrasing game. All of these knowledge inputs elicit user ...

متن کامل

Train the Machine with What It Can Learn - Corpus Selection for SMT

Statistical machine translation relies heavily on available parallel corpora, but SMT may not have the ability or intelligence to make full use of the training set. Instead of collecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting the full potential of existing parallel corpora. We first identify literally translated sentence pairs via lexic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011